DRAFT: Parquet 3 metadata with decoupled column metadata#242
Closed
pitrou wants to merge 1 commit intoapache:masterfrom
Closed
DRAFT: Parquet 3 metadata with decoupled column metadata#242pitrou wants to merge 1 commit intoapache:masterfrom
pitrou wants to merge 1 commit intoapache:masterfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parquet 3 metadata proposal
This is a very rough attempt at solving the problem of
FileMetadatafootprint and decoding cost, especially for Parquet files with many columns (think tens of thousands columns).Context
This is in the context of the broader "Parquet v3" discussion on the mailing-list. A number of possible far-reaching changes are being collected in a document.
It is highly recommended that you read at least that document before commenting on this PR.
Specifically, some users would like to use Parquet files for data with tens of thousands of columns, and potentially hundreds or thousands of row groups. Reading the file-level metadata for such a file is prohibitively expensive given the current file structure where all column-level metadata is eagerly decoded as part of file-level metadata.
Contents
It includes a bunch of changes:
O(n_columns + n_row_groups)instead ofO(n_columns * n_row_groups)Jira
Commits
Documentation